“From the Medium artile, ‘Mistakes, we’ve drawn a few’’”
We’re going to re-create the graphs below from the Medium article, ‘Mistakes, we’ve drawn a few’. The author of this article claims the mistake with this particular graph below:
Mistake: Choosing the wrong visualisation method
Let’s take a closer look at the original graph with some of her comments:
“We published this polling chart in Espresso, our daily news app. It shows attitudes to the outcome of the EU referendum, plotted as a line chart. Looking at the data, it appears as if respondents had a rather erratic view of the referendum result — increasing and decreasing by a couple of percentage points from one week to the next.”
“Looking at the data, it appears as if respondents had a rather erratic view of the referendum result — increasing and decreasing by a couple of percentage points from one week to the next.”
We’re going to:
Below we load the necessary packages:
There are four data files for this article, which we import from the course GitHub repo below:
Balance dataThe first dataset we’ll investigate is the Balance data:
glimpse(Balance)
Rows: 266
Columns: 4
$ country <chr> "Belgium", "Germany", "Estonia", "Ireland", "Gr…
$ account_type <chr> "current", "current", "current", "current", "cu…
$ year <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009,…
$ value <dbl> -3755.0, 141234.0, 360.0, -7912.1, -29323.0, -4…
We can see this data contains values per country in the EU, grouped by year and some kind of account_type. These clearly aren’t the data from the original and better graphs above, so we won’t be diving deeper into each variable.
But we can use the Balance dataset to create a column-graph of value (on the x) and country (on the y).
sign variableBecause value contains both positive and negative numbers, we’ll create an identifier for the sign.
Balance %>%
mutate(sign =
if_else(condition = value > 0,
true = "Positive Balance",
false = "Negative Balance"))
# A tibble: 266 × 5
country account_type year value sign
<chr> <chr> <dbl> <dbl> <chr>
1 Belgium current 2009 -3755 Negative Balance
2 Germany current 2009 141234 Positive Balance
3 Estonia current 2009 360 Positive Balance
4 Ireland current 2009 -7912. Negative Balance
5 Greece current 2009 -29323 Negative Balance
6 Spain current 2009 -46191 Negative Balance
7 France current 2009 -10652 Negative Balance
8 Italy current 2009 -29717 Negative Balance
9 Cyprus current 2009 -1431 Negative Balance
10 Latvia current 2009 1463 Positive Balance
# … with 256 more rows
A great way to check your work when creating new categorical variables is passing the output to janitor::tabyl(). In this case, we want to see the countries broken down by our new sign variable.
Balance %>%
mutate(sign =
if_else(condition = value > 0,
true = "Positive Balance",
false = "Negative Balance")) %>%
janitor::tabyl(country, sign)
country Negative Balance Positive Balance
Austria 7 7
Belgium 13 1
Cyprus 14 0
Estonia 4 10
Finland 12 2
France 14 0
Germany 5 9
Greece 14 0
Ireland 11 3
Italy 11 3
Latvia 12 2
Lithuania 11 3
Luxembourg 2 12
Malta 11 3
Netherlands 7 7
Portugal 11 3
Slovakia 11 3
Slovenia 9 5
Spain 11 3
We’ll create a canvas (or initialized plot) with value on the x, country on the y, grouped and filled by sign.
balance_global_layer <- Balance %>%
mutate(sign =
if_else(condition = value > 0,
true = "Positive Balance",
false = "Negative Balance")) %>%
ggplot2::ggplot(aes(
x = value,
y = country,
group = sign,
fill = sign
))
balance_global_layer
Now that we have a blank canvas, we’ll build a column-graph using geom_col(). We know we won’t need a legend, to we’ll set this to FALSE.
balance_global_layer +
geom_col(show.legend = FALSE)
This graph is stacking the columns, which is the default position in geom_col(). We can also use "dodge" or "dodge2":
balance_global_layer +
geom_col(position = "dodge", show.legend = FALSE)
balance_global_layer +
geom_col(position = "dodge2", show.legend = FALSE)
We’ll stick with position = "dodge", and call this balance_geom_col_layer
balance_geom_col_layer <- balance_global_layer +
geom_col(position = "dodge",
show.legend = FALSE)
We’ll add some color to denote the positive ("green4") and negative ("firebrick") values from sign, and fix the x axis with scale_x_continuous() and scales::comma.
This layer gets saved as balance_x_scale_layer.
balance_x_scale_layer <- balance_geom_col_layer +
scale_fill_manual(values = c("firebrick", "green4")) +
scale_x_continuous(labels = scales::comma)
balance_x_scale_layer
We have to other categorical variables in the Balance data, account_type and year. Let’s check how these look with facet_wrap()
balance_x_scale_layer +
facet_wrap(~ account_type)
balance_x_scale_layer +
facet_wrap(~ year)
Both of these would work–it just depends on which category is best at illustrating our argument. We’ll arbitrarily choose the account_type for balance_facet_wrap_layer.
balance_facet_wrap_layer <- balance_x_scale_layer +
facet_wrap(~ account_type)
balance_facet_wrap_layer
Finally–we can add a theme and labels. We’ll use ggplot2::theme_minimal().
balance_facet_wrap_layer +
ggplot2::theme_minimal() +
labs(title = "Balance Data (from EU referendum)",
subtitle = "Country values (positive or negative) by account type")
Corbyn dataThe Corbyn is below:
| political_group | avg_facebook_likes |
|---|---|
| Jeremy Corbyn | 5210 |
| Labour Party | 845 |
| Momentum | 229 |
| Owen Smith | 127 |
| Andy Burnham | 105 |
| Saving Labour | 56 |
There isn’t much data in Corbyn, but we’ll look at the distribution of avg_facebook_likes by political_group using geom_col():
Corbyn %>%
ggplot(aes(x = political_group, y = avg_facebook_likes)) +
geom_col(aes(fill = political_group))
This is more color categories than we’d like for a graph, but these aren’t the data we’re trying to correct, so we’ll move on.
PensionsBelow are the Pensions data:
glimpse(Pensions)
Rows: 35
Columns: 3
$ country <chr> "Australia", "Austria", "Belgium", "Br…
$ pop_65_percent <dbl> 15.04, 18.76, 18.22, 7.84, 16.14, 11.0…
$ gov_spend_percent_gdp <dbl> 5.20, 13.86, 10.36, 12.00, 4.31, 3.25,…
We can see a single categorical variable, country, and two percentages (pop_65_percent and gov_spend_percent_gdp).
We’ll use geom_boxplot() to graph the pop_65_percent and gov_spend_percent_gdp variables (note we use y = "" in aes() and in the labs() function to remove the y)
Pensions %>%
# the variable
ggplot(
aes(
x = pop_65_percent,
y = "")) +
geom_boxplot() +
labs(y = "")
Pensions %>%
# the variable
ggplot2::ggplot(aes(
x = gov_spend_percent_gdp,
y = "")) +
geom_boxplot() +
labs(y = "")
Box-plots are handy if we have a categorical variable we want to view distributions across, too.
If we wanted to show how pop_65_percent changes with the gov_spend_percent_gdp, we could get the top 5 countries with the gov_spend_percent_gdp, then order our y axis by the pop_65_percent, but set the size to gov_spend_percent_gdp (I’ve multiplied it by 10 here just to make the points a little larger):
Pensions %>%
arrange(desc(gov_spend_percent_gdp)) %>%
group_by(country) %>%
slice_max(gov_spend_percent_gdp) %>%
head(5) %>%
ggplot(
aes(x = pop_65_percent,
y = fct_reorder(.f = country, .x = pop_65_percent),
color = country)) +
geom_point(aes(size = gov_spend_percent_gdp*10), show.legend = FALSE) +
labs(y = "Country", x = "Pop 65 percent")
Again–these aren’t the data we’re trying to fix, so we’ll move on.
Brexit dataThe Brexit data are below:
glimpse(Brexit)
Rows: 85
Columns: 3
$ date <chr> "2/8/16", "9/8/16", "17/08/16", "23…
$ percent_responding_right <dbl> 46, 45, 46, 45, 47, 46, 45, 45, 46,…
$ percent_responding_wrong <dbl> 42, 44, 43, 43, 44, 43, 44, 44, 43,…
These are the data from the graph, so we’ll break each step down layer-by-layer.
dateFirst we need a properly formatted date
Rows: 85
Columns: 3
$ date <date> 2016-08-02, 2016-08-09, 2016-08-17…
$ percent_responding_right <dbl> 46, 45, 46, 45, 47, 46, 45, 45, 46,…
$ percent_responding_wrong <dbl> 42, 44, 43, 43, 44, 43, 44, 44, 43,…
We’ll start with some line-graphs of date vs. percent_responding_right and percent_responding_wrong
Place date (on the x) and percent_responding_right (on the y)
Brexit %>%
ggplot(aes(x = date,
y = percent_responding_right)) +
geom_line(size = 2.7, color = "cornflowerblue")
Plot date (on the x) and percent_responding_wrong (on the y)
Brexit %>%
ggplot(aes(x = date,
y = percent_responding_wrong)) +
geom_line(size = 2.7, color = "firebrick")
We want both percent metrics in the same column, indexed on poll. The code below will assign the output from tidyr::pivot_longer() to TidyBrexit:
TidyBrexit <- Brexit %>%
pivot_longer(-date,
names_to = "poll",
values_to = "percent"
) %>%
mutate(
poll = str_replace_all(poll, "_", " ")
)
glimpse(TidyBrexit)
Rows: 170
Columns: 3
$ date <date> 2016-08-02, 2016-08-02, 2016-08-09, 2016-08-09, 201…
$ poll <chr> "percent responding right", "percent responding wron…
$ percent <dbl> 46, 42, 45, 44, 46, 43, 45, 43, 47, 44, 46, 43, 45, …
Now we pass the pivoted data to geom_line() and view the results.
TidyBrexit %>%
ggplot2::ggplot(aes(
x = date, y = percent,
group = poll, color = poll)) +
geom_line(size = 2.7) +
scale_color_manual(values = c("cornflowerblue", "firebrick3")) +
theme_minimal() +
labs(title = "Bremorse",
subtitle = "In hindsight, do you think Britain was right or wrong to leave the EU\n% responding",
x = "", y = "")
This is pretty close to the Original (minus the realigned y axis, text labels, and reformatted x axis).
Brexit graphNow we’ll create the ‘Better’ graph, and make some of our own ‘improvements.’
Use TidyBrexit to build the ggp_brexit plot object layer with date on the x, percent on the y, and group assigned to poll:
ggp_brexit_global_layer <- TidyBrexit %>%
ggplot(mapping = aes(x = date, y = percent, group = poll))
ggp_brexit_global_layer
ggplot2::geom_point()Add the points with ggplot2::geom_point(), but include the show.legend = FALSE, size = 2.5, and alpha = 1/3. Save this as ggp_brexit_point_layer
ggp_brexit_point_layer <- ggp_brexit_global_layer +
geom_point(aes(color = poll),
show.legend = FALSE, size = 2.5, alpha = 1 / 3)
ggp_brexit_point_layer
geom_smooth()Add the ggplot2::geom_smooth() to ggp_brexit_point_layer
ggp_brexit_point_layer +
ggplot2::geom_smooth()
color aethetic to pollggp_brexit_point_layer +
ggplot2::geom_smooth(aes(color = poll))
se = FALSEAdd the se = FALSE to remove the gray area around the lines, and remove the legend with show.legend = FALSE and create the ggp_brexit_smooth_layer object.
ggp_brexit_smooth_layer <- ggp_brexit_point_layer +
geom_smooth(aes(color = poll),
se = FALSE, show.legend = FALSE)
ggp_brexit_smooth_layer
scale_color_manual()Add the colors cornflowerblue and firebrick3 to the ggp_brexit_smooth_layer layer and create a new layer called ggp_brexit_colors_layer.
ggp_brexit_colors_layer <- ggp_brexit_smooth_layer +
scale_color_manual(values = c("cornflowerblue", "firebrick3"))
ggp_brexit_colors_layer
Use the ggplot2::geom_text() layer to add the text to the ggp_brexit_colors_layer layer, and name the new plot ggp_brexit_annotate_text_layer
ggp_brexit_annotate_text_layer <- ggp_brexit_colors_layer +
annotate(geom = "text",
label = "Wrong",
size = 6,
color = "firebrick3",
x = as.Date("2018-01-01"),
y = 45.5
) +
annotate(geom = "text",
label = "Right",
size = 6,
color = "cornflowerblue",
x = as.Date("2018-01-01"),
y = 42
)
ggp_brexit_annotate_text_layer
y axisMove the y axis to the right side of the graph with ggplot2::scale_y_continuous()
ggp_brexit_y_scale_layer <- ggp_brexit_annotate_text_layer +
ggplot2::scale_y_continuous(position = "right")
ggp_brexit_y_scale_layer
Use the ggplot2::labs() function to create the labels layer.
labs_eco <- labs(
title = "Bremorse",
subtitle = "In hindsight, do you think Britain was right or wrong to leave the EU\n% responding",
x = "", y = "")
Add the labels layer to ggp_brexit_y_scale_layer and create a new plot titled, ggp_brexit_labs.
ggp_brexit_labs <- ggp_brexit_y_scale_layer + labs_eco
ggp_brexit_labs
Add the ggthemes::theme_economist_white() layer and set the following arguments:
gray_bg = FALSEbase_size = 12base_family = "Verdana"ggp_brexit_labs +
ggthemes::theme_economist_white(
gray_bg = FALSE,
base_size = 12,
base_family = "Verdana"
)